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Abstract 

We describe a Grobner basis of relations among conditional probabilities in a 
discrete probability space, with any set of conditioned-upon events. They may be 
specialized to the partially-observed random variable case, the purely conditional 
case, and other special cases. We also investigate the connection to generalized 
permutohedra and describe a "conditional probability simplex." 

1 Relations among conditional probabilities 

In 1974, Julian Besag [I] discussed the "unobvious and highly restrictive consistency 
conditions" among conditional probabilities. In this paper we give an answer in the 
discrete case to the question What conditions must a set of conditional probabilities satisfy 
in order to be compatible with some joint distribution? 

Let Q = {1, . . . , m} be a finite set of singleton events, and let p = (pi, . . . ,p m ) be 
a probability distribution on them. Let $ be a set of observable events which will be 
conditioned on, each a set of at least 2 singleton events. Then for events I C J, J 
in S, we can assign conditional probabilities for the chance of I given J, denoted pnj. 
Settling Besag's question then becomes a matter of determining the relations that must 
hold among the quantities pi\j. For example, Besag gives the relation (see also [3]), 

-P( x ) _ TT P( x i\ x l, ■ ■ ■ ; gi-l) Vi+U • • • i Un) 

p (y) fJi P(yi\xi,...,x l ^ 1 ,y i+1 ,...,y n )' 

Since there are in general infinitely many such relations, we would like to organize them 
into an ideal and provide a nice basis for that ideal. A quick review of language of ideals, 
varieties, and Grobner bases appears in Geiger et al. [Til p. 1471] and more detail in Cox 
et al. [7j. In Theorem 13. 2[ we generalize relations such as (CQ) and Bayes' rule to give a 
universal Grobner basis of this ideal, a type of basis with useful algorithmic properties. 

The second result generalized in this paper is due to Matus [15]. This states that the 
space of conditional probability distributions (pi\ij) conditioned on events of size two maps 
homeomorphically onto the permutohedron. In Theorem 14.31 we generalize this result to 
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arbitrary sets § of conditioned-upon events. The resulting image is a generalized permu- 
tohedron [20l 124"] . This is a polytope which provides a canonical, conditional-probability 
analog to the probability simplex under the correspondance provided by toric geometry 
[23] and the theory of exponential families. 

Work on the subject of relations among conditional probabilities has primarily focused 
on the case where the events in S correspond to observing the states of a subset of n ran- 
dom variables. Arnold et. al. [2] develop the theory for both discrete and continuous 
random variables, particularly in the case of two random variables, and cast the com- 
patibility of two families of conditional distributions as a solutions to a system of linear 
equations. Slavkovic and Sullivant [22J consider the case of compatible full conditionals, 
and compute related unimodular ideals. 

This paper is organized as follows. In Section [21 we introduce some necessary defini- 
tions. In Section [31 we give compatibility conditions in the general case of m events in a 
discrete probability space, with any set $ of conditioned-upon events. These conditions 
come in the form of a universal Grobner basis, which makes them particularly useful 
for computations: as a result, they may be specialized to the partially observed random 
variable case, the purely conditional case, and other special cases simply by changing & '. 
In [TH [T7], we have seen that permutohedra and generalized permutohedra [2U] play a 
central role in the geometry of conditional independence; the same is true of conditional 
probability. The geometric results of Matus [15] map the space of conditional probability 
distributions (Definition 12. ip for all possible conditioned events S — {I C [m] : |/| > 2} 
onto the permutohedron P m _j. See Figured] for a diagram of the 3-dimensional permu- 
tohedron. In Section 0], we will discuss how to extend this result to general $ , in which 
case we obtain generalized permutohedra as the image. This will be accomplished using a 
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Figure 1: The permutohedron P 4 . 

version of the moment map of toric geometry (Theorem 17. ip . In Section [5] we discuss how 
to specialize our results to the case of n partially observed random variables, including as 
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an example how to recover the relation (TjQ). Finally, in Section [6] we use this specialization 
to explain the relationship of Bayes' rule to our constructions. In the Appendix we recall 
a few necessary facts about toric varieties. 

2 Conditional probability distributions 

Let $ be a collection of subsets /, with |/| > 2, of [m] = Q — {1, . . . , m}. Let C[<f] denote 
the event algebra, the polynomial ring with indeterminates piu for all / G § and i G /, 
i.e. one unknown for each elementary conditional probability. Then we denote by 

11*11 = 

I&g 

the number of variables of C[<f]. We write for Pi\\ m ] when [m] G $ . The unknowns 
of C[<f] are meant to represent conditional probabilities, as we now explain. The set 
{1, . . . , m} indexes the m disjoint events, and a point (pi, . . . ,p m ) G IR> with YljPj = 1 
represents a probability distribution on these events. When pj > for all j, the conditional 
probability of event i given event / containing it is 

Hi = y^V- (2) 

To extend this notion to the case P(I) = J2jeiPi = ^> anc i to be able to deal with multiple 
conditioning sets, we make the following standard definition [3], considered in this form 
by Matvis [15]. 

Definition 2.1. A conditional probability distribution for $ is a point : i G / G 

<f) G M>o" such that for all J,K e£ with J C K, 

(i) Eiejft|J = 1 

(ii) for all i G J, p { \ K = Pi\jY,j & jPj\K- 

Observe that (ii) is a relative version of ([2]), as ([2]) follows from (ii) with K = [m], 
J = I, and ^2 ieI Pi 7^ 0. If on the other hand Ylij^jPj\K — 0, the whole probability 
simplex Aj := {(pj\j)j£j '■ Pj\j > 0, J2jejPj\J = 1} satisfies the definition. This freedom 
is known in probability theory as versions of conditional probability [5j. In algebraic 
geometry, this corresponds to the notion of a blow-up, [13] and the simplex Aj to the 
exceptional divisor. Before we give a homogenized version of Definition 12.11 we consider 
the homogenized version of probability. 

2.1 A projective view of probability 

Consider a probability space with m disjoint atomic events ([m], 2' m ', P). The space 
of probability distributions P on them is typically represented as a probability simplex, 
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where each P(i) is a coordinate pi such that Pi > and YliPi = 1- We will be describing 
families of probability distributions in terms of algebraic varieties, and we prefer to think 
of points (pi : • • • : p m ) as lying in complex projective space. This is equivalent to 
letting V = C{ei, . . . , e m } = C m be the complex vector space spanned by the outcomes 
(singleton events) and considering points p G FV as representing mixtures over outcomes 
or probability distrubutions. There are two ways to match up the notion of the probability 
simplex with that of complex projective space. One way to do so, restriction, identifies 
the probability simplex A m _i with the real, positive part of the affme open ^\ y^ ^ of 
the P m_1 with homogeneous coordinates (y% : yi : ■ • • : y m ) as illustrated in Figure [2j 




Figure 2: Probability simplex in the projective plane 

Alternatively we can use projection, equivalent in the special case that (y± : ■ ■ ■ : y m ) e 
A m _i, via the moment map (Theorem 17. ip . The identity matrix A = I m comprised of 
standard unit vectors e« defines the probability simplex A m _i = conv(^4). The toric 
variety Y4 is then the projective space P m_1 and the moment map is: 

li ■ P"^ 1 - A m _! 

fi{(Vl ■ ■■■ ■ Vm)) = sr 1 \ \ \Vi\ e i 

The moment map /1 is the identity map on the probability simplex, but allows us to 
define a point on the probability simplex for more general points in complex projective 
space. The fiber over any of these points is the torus (S 1 )™, a product of m unit circles, 
since fj,(yi : y m ) = \i{e %dx y\ : •■■ : e l9m y m ). A similar point of view appears in 

quantum physics; here V = C{x : x a classical state} is the Hilbert space representing 
quantum state and the modified moment map n'(y) : vTT^p \Vi\ 2e i defines the probability 
of observing a classical state (singleton event) [18]. 

One interpretation of this freedom is that it suggests there are circumstances where 
allowing probabilities to be negative and even complex in intermediate computations 
might be useful. This may seem odd, but it can be argued that negative probabilities are 
already implicitly employed [9]. For example, characteristic function methods implicitly 
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write a density as a linear combination of basis functions with ranges unrestricted to 
IR>o- Even if we are uncomfortable with such interpretations, the compactification and 
homogenization can simply be viewed as a convienient algebraic trick to make it easy to 
determine the relations among conditional probabilities we are ultimately interested in. 
Moreover, for most purposes C can be replaced with M. [11] as the base field for our ring, 
and these relations are unchanged. 

2.2 Homogeneous conditional probability 

Analogously to the projective version of probability in Section |2~TI where we replaced the 
requirement that probabilities Pi, ■ ■ ■ ,p m sum to one with viewing them as coordinates of 
a point in projective space, we now define a multihomogeneous version of Definition 12. 11 
Now, a conditional probability distribution is represented by a point in the product of 
projective spaces. This product has one pi 7 ! -1 for each event I G S which is conditioned 
upon, and each factor space pl 7 l _1 is equipped with homogeneous coordinates (p^\i : ■ • • : 

Definition 2.2. A projective conditional probability distribution for g is a point p = 
{(jPh\l ■ ■ ■ ■ ■ Pi m \l)j I £ <?) inside H/e^^' 7 ' -1 sucn that f° r a ^ J, K E g and i G J C K, 

(%2Pj\j)Pi\K = Pi\j(^Pj\K) 

Definition 12.21 specifies the following ideal in the event algebra C[g}: 

J* = (Q2Pj\j)Pi\K ~ Pi\jQ2Pj\ic) ■ J,K G g, i G J C K). 

jeJ jeJ 

This ideal consists of all polynomial relations that a point P = (pi\i) in H/g^-^' -1 m ust 
satisfy to be a projective conditional probability distribution. In particular, any honest 
conditional probability distribution must satisfy these. If we denote by {ej : I G g} a 
basis of this ideal Jg is multihomogeneous with respect to the grading deg(pi\j) = e/ 
(see e.g. [TH] for more on such gradings). In what follows, it will be convenient to 
abbreviate pj\j := J2jejPj\J- Thus pj\j would be equal to 1 for honest distributions, by 
Definition 12 .![ but here we regard it as a linear form in C[g]. Let ag denote the product 
Wi & i & gPi\i of all of the \\S\\ variables in C[g), and let f3g denote the product ri/e^^l-f- 
The saturation (I : /°°) of an ideal I is the ideal generated by all polynomials g such that 
f m g G I for some m [23] • Now we define the ideal Ig, when [m] G g, by the saturation 

I, := (J, : (a,P,)°°). 

When [m] ^ g, let g' — g U [m] and set Ig :— Igr n C[g]. The purpose of saturation is 
to make sure the desired behavior occurs when some coordinates are zero; for example, 
it is necessary to move between the conditional independence ideals [11] generated by 
expressions P(X = x,Y = y\Z = z) — P(X = x\Z = z)P{Y = y\Z = z) and by the 
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cross product differences P{x, y, z)P(x', y', z) — P{x, y', z)P(x', y, z) algebraically without 
assuming anything about the positivity of the probabilities in question. 

In the next section, we describe a matrix Ac such that Ig arises as the toric ideal 
(Section [7]). Our first main result will be a universal Grobner basis for the toric ideal Ig. 
Grobner bases, particularly universal Grobner bases, have many algorithmic properties 
that make them a very complete description of an ideal. Cox, Little, and O'Shea [7J give 
an accessible overview; see also [231 E2] • 

3 A universal Grobner basis for relations among con- 
ditional probabilities 

A Bayes binomial in C[<?] is a binomial relation of the form 

Pi\KPj\J -Pj\KPi\J 

for i,j G J C K, with J, K G £ '. Let lBayes(<?) denote the ideal they generate. Bayes 
binomials get their name because they come from Bayes' rule; more explanation is given 
in Section [61 

Proposition 3.1. The ideal generated by the Bayes binomials contains Jg and is con- 
tained in the saturation of Jg by the probabilities that would sum to one (where again 

fa = Ui & gPi\i) : 

Jg ^ -^Bayes(<r) ^ {Jg '■ {fa) °°) 

and in particular, /Bayes(<f) ^ Ig- 

Proof. The ideal Jg is generated by the degree-2 polynomials p.j\jPi\K —Pi\jPj\k for J,K G 
S and i G J C K. For each i, j G J, we have a = Pj\j{pj\jPi\K — Pi\jPj\k) and b = 
Pi\j{pj\jPj\K ~ P]\jPj\k) in Jg, so a - b = pj\j(Pj\jPi\ K ~ Pj\KPi\j) is in Jg and ^Bayes^) Q 
{Jg : {j3g)°°). For the first inclusion, if Pj\jP%\k — P%\jPj\k is a generator of Jg, we may 
write it as an element Y^j & .j{Pi\KPj\j - Pj\KPi\j) of iBayes(<?)- □ 

Our universal Grobner basis of Ig will be given combinatorially by the cycles of a 
labeled bipartite graph G{<£), defined as follows: 

Vertices: one vertex m for each I G $ and one vertex Vi for each % G U/ e #7 

Edges: a directed edge ui — > v-i for each I G $ and % G I 

Edge Labels: the edge uj — > t> j is labeled with the indeterminate 

For example, with n = 4, the labeled graph G for = {{1, 2}, {1, 2, 3}, {1, 2, 3, 4}} is 
shown in Figure [31 Each oriented cycle C in the undirected version of G defines a binomial 
fc as follows: each edge label is on the positive side of the binomial if its edge is directed 
with the cycle, and on the negative if against. For example, in the graph in Figure [31 
consider the cycle (1234, 3, 123, 1, 1234). The edges p 3 andpi|i 2 3 are directed with the cycle 
and the edges p 3 |i 2 3 and pi are directed against, so the corresponding binomial is P3P11123 — 
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Figure 3: Bipartite graph for S = {{1, 2}, {1, 2, 3}, {1, 2, 3, 4}}. 



Pl|12 1 <Pl|13 

12^ ^ 13 



P2112 



P3|13 



2 3 

P2|23 23 P3|23 



Figure 4: Outer cycle of the bipartite graph for S = {{1, 2}, {1, 3}, {2, 3}, {1, 2, 3}}. 



P3|i23Pi- F° r a higher degree example, with n = 3 and ^ = {{1, 2}, {1, 3}, {2, 3}, {1, 2, 3}}, 
we get Pi|i 2 £>3|i3P2|23 — P2|i2P3|23Pi|i3 from the outer cycle, as shown in Figure 0J A cycle is 
induced if it has no chord. 

Theorem 3.2. The binomials defined by the cycles ofG($) give a universal Grobner basis 
for Ig. Moreover, Ig is generated by the induced cycle binomials, though not necessarily 
as a Grobner basis. 

In order to prove Theorem 13.21 we first need to recall some facts about unimodular 
toric ideals, of which Ig is an example. Unimodular matrices and unimodular toric ideals 
are defined and characterized as follows, following Sturmfels [23] . A triangulation of A is 
a collection $ of subsets B of the columns of A such that {pos(£>) : B G 5} is the set of 
cones in a simplicial fan with support pos(^4). A triangulation of A is unimodular if the 
normalized volume [23J is equal to one for all maximal simplices B in the triangulation. 
The matrix A is a unimodular matrix if all triangulations of A are unimodular. We define 
a unimodular toric ideal in the following definition-proposition. 

Proposition 3.3. I23tf A toric ideal I a is called unimodular if any of the following equiv- 
alent conditions hold. 

(i) Every reduced Grobner basis of 1^ consists of squarefree binomials, 

(ii) A is a unimodular matrix, 

(Hi) all the initial ideals of I4 are squarefree. 
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A special class of unimodular matrices are those coming from bipartite graphs [U [22] . 
Let G = (U, V, E) be a bipartite graph. In our case, G(S > ) has 



Let A be the vertex-edge incidence matrix of G: The rows of A are labeled u\, . . . ,u\u\, 
vi, . . . , v\v\, the columns are labeled with the edges, and ay is 1 if vertex i is in edge j and 
zero otherwise. For a cycle C in the graph, the cycle binomial fc is defined (up to sign) 
as above. Let nj^ be the map R"^" — > Rl^W^I defined by applying ^4. We say u £ ker(7r_4) 
is a circuit if supp(w) is minimal with respect to inclusion in ker(7r_4) and the coordinates 
of u are relatively prime [22] • Equivalently, a circuit is an irreducible binomial x u+ — x u ~ 
of the toric ideal 1^ with minimal support. The Graver basis of the ideal I4 consists of all 
circuits. For A from a bipartite graph, the circuits of A are precisely the cycle binomials 
of the graph [21, 22J. Additionally, a Graver basis is also a universal Grobner basis in the 
case of unimodular toric varieties (Proposition 8.11 of [23]). We summarize these results 
in the following proposition. 

Proposition 3.4. The vertex-edge incidence matrix A of a bipartite graph G = (U, V, E) 
is unimodular, so I a is a unimodular toric ideal. The cycle binomials of G are the circuits 
of A, and therefore define the Graver basis of I a- In particular, they give a universal 
Grobner basis for 1^. 

Now we are able to prove our theorem. 

Proof of Theorem \3.2[ Let Ag(s) be the vertex-edge incidence matrix of G{$). By Propo- 
sition [331 hs cycle binomials (circuits) give a universal Grobner basis of Iagis)- m f ac t> 
the induced cycles are enough to generate this ideal [Ij. Suppose C is a cycle and e a 
chord, and split C into two cycles C\ and C2, both containing e (but in opposite direc- 
tions). Associate cycle binomials fc x and fc 2 , respectively. Then the S'-polynomial (JTJ) 
with the e-containing terms leading is fc- However, this is no longer necessarily a Grobner 
basis. For example, let S = {{12}, {23}, {123}} as in Figured 



The outer cycle C=1^12^2 ^23 ^3^1 23 — > 1 gives the cycle binomial 
fc = P1112P2123P31123 ~ P2|i2P3|23Pi|i23- The cycle C has a chord 2 - 123, and the binomial 
fc lies in the ideal of the two binomials 



U = {ui: I e <£} and V = {v^ : i e U/ e «?/}. 



(3) 




Figure 5: Bipartite graph for £ = {{1, 2}, {2, 3}, {1, 2, 3}}. 



Pl|12f>2|123 — P2|12Pl|123 and p 2 1 23^3 1 123 ~ P3\ 23?2 1 123 
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after splitting along the chord. These are both the induced cycles of the graph. However, 
for a term order prioritizing P21123 (e.g. lexicographic with P21123 >- • • • ), the leading 
term of fc cannot lie in the initial ideal (pi|i2P2|i23,P3|23P2|i23) of the ideal generated by 
the chordal binomials. 

Next we show that the graph ideal and conditional probability ideal coincide, Ia g(S) = 
Ig. For the containment Iagis) — I^i ^ YS ^ observe that iBayes(<?) Q ^Agw This * s because 




Figure 6: Subgraph of G($) giving a Bayes binomial. 

if J, K G $ with i,j £ J C K, we have the subgraph in Figure El which is a cycle with 
associated cycle binomial Pj\jPi\K ~Pi\jPj\K- Together with Proposition l3.lt we now have 

Jg Q -^Bayes(^) Q IAg{$) 

so, since saturation is inclusion-preserving and Ia G{S) is prime, 

h = (J, : (a,/?,) 00 ) C (I Aaw : (aM°°) = I Aa{sy 
Now we show the reverse inclusion Ia G{3) — Is- Again by Proposition 13.11 we have 

^Bayes(<?) Q Ig- 

Now assume that [in] G <f, so that pi, . . . ,p m G C[<f]. We claim that in fact Ia G{S) Q 
(-^Bayes(<?) : YYiLx Vi)i from which the result will follow. Let C be an induced cycle of G(£), 
and fc its cycle binomial. We must show that this cycle binomial can be obtained from 
the Bayes binomials, up to multiplication by IlHiP*- Let C be the cycle 

h <— Ji — ► i 2 <— h — ► > ik <— Jk — ► h- 

With this notation we have %i G J\, ii, 13 G J2, ■ ■ ■ , h, ik G Jk- Then 

fc = Pi2\JiPi:i\J2 ' ' ' Pik\Jk-iPh\Jk ~ Pi\\J\Pii\-h ' ' ' Pik\Jh m 

We show the first monomial of (Yli=iPi)fc is equal to the second mod lBayes(<?)- Pah off 
as follows: 

(PhPhPia ■ ■ ■Pi h ) Pii\JiP ia\JaPU.\Ja ' ' ' Pi k \Jk-iPh\ Jk Ste P 1 
= {PhPnPh ■ ■ ■Pi k )Ph\Ji Pis\J 1 P u\J3 ■ ■ -Ptk\Jk-iPii\Jk Ste P 2 
= (PhPisPh ■ ■ ■ Pt k )Ph\JiPi2\J2PiA\-h ■ ■ -Ptk\Jk-iPii\Jk Ste P 3 
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where the equalities hold mod iBayes(<?)- Continuing in this fashion, at step k — 1 we have 

= (Pi 2 Ph ■ ■ ■Pi k -i Pik-iP ik)Ph\JiPi2\J2 ■ ■ ■ Pi k -2\Jk-2 Pi k \Jk-iP h\Jk Ste P k ~ 1 

= (Pi 2 Pi 3 ■ ■ ■ P lk - 1 Pi k Pi h )Ph\J 1 Pi2\J 2 ■ ■ ■ Pi k -2\J k - 2 Pik-i\J k -i Pii\J k Ste P k 

= (Pi 2 Pi 3 ■ ■ ■ Pi k -iPi k Pii)Ph\JiPi2\J 2 ■ ■ ■ Pik-2\Jk-2Pik-l\Jk-lPik\Jk Ste P k + 1 

as desired. In terms of G($), this amounts to breaking up a long cycle into 4-cycles 
passing through [to], and erasing the overlaps among these cycles. Thus since the induced 
cycles generate Ia g i S ) > we have 

m 

Ia g(S) Q (^Bayes^) = \\Pi)) Q If 
i=l 

This proves the result in the special case [to] G <§ '. In the general case, suppose we 
have some $ not containing [to], enabling us to obtain relations among 'pure' conditional 
probabilities (i.e. excluding pi, . . . ,p m )- Let $' = $ U [m] and apply the special case of 
the Theorem. Then by [231, Proposition 4.13(c)], since we have a universal Grobner basis, 
we just intersect it with the smaller coordinate ring to obtain a universal Grobner basis 
of the smaller ring. This corresponds here to removing the set [to] from $ and taking the 
cycle binomials as our new Grobner basis. □ 

4 Conditional probability and the moment map 

In this section we show how to recover and generalize some results of Matus [15] using toric 
geometry. The main result we will expand upon maps the space of conditional probability 
distributions (Definition I2.ip for all possible conditioned events $ — {I C [to] : |/| > 2} 
onto the permutohedron by first projecting down to events of size 2, § = {I C [m] : |/| = 
2}- 

Theorem 4.1 (Matus [15]). For <§ = { J C [m] : |/| > 2} and p a conditional probability 
distribution (Definition \2. the map W : M"^" — > M m , given by 

Wi(p) = 

je[m]\i 

restricts to a homeomorphism of the space of conditional probabilities onto the m — 1 
dimensional permutohedron P m _i. 

Note that the linear map W is the restriction of A = Ag{s) to the rows labeled by the 
vertex set V in G ([3]) and to the columns labeled by two-event conditional probabilities 
(edges in G($)) pi\ij. In fact A, will in general define a map from the space of projective 
conditional probability distributions onto a generalized permutohedron A# defined below. 
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First consider the multiprojective toric variety Z A cut out of ri/e,?^' 7 ' -1 D Y the equa- 
tions of Theorem 13.21 i.e. the space of projective conditional probability distributions. In 
Section [7] we recall the definition of the affine toric variety X4 associated to an integer 
matrix A, and the projective toric variety Y4 associated to a Z-graded matrix A (that 
is, a matrix A such that (1, 1, ... , 1) lies in its rowspan). Given a matrix A = Ag(&), the 
space of ^-projective conditional probability distributions Z A is the closure of the image 
of the map : 9 1— > 9 A , viewed as an element of Ylieg P' 7 ' -1 - Equipping this product 
space with multihomogeneous coordinates ((pi^i : • • • : pj.^i/),/ E <f), the variety Z A is 
cut out by the (multihomogeneous) toric ideal Suppose that we have U/ g # = [m]. 
Then because we view the points ({pi^i : • • • : p i}I} \j),I E <§) as elements of rLre<?^ > ' I '~ 1 ' 
the dimension of this variety is m — 1 as expected, though the rank of A is larger. 

We now develop a version of the moment map of toric geometry applicable to the 
variety of projective conditional probability distributions. Hereafter we index the columns 
of A by the conditional probability they represent, i.e. A = (a.i\i : i E / E $). We will 
require a multigraded notion to play the role of the convex hull conv(^4) in the moment 
map. We define 

mconv(^l) = {^2^2 : E E> , J^A?'! 7 = !}• 

leS jei jei 

A function w : 2™ — > R is called submodular if w(I) + w(J) > w(I D J) + w(I U J) 
for I, J C [n]. Each subset / of [m] defines a submodular function wi on 2t n l by setting 
Wi ( J) = 1 if I PI J is non-empty and wi(J) = if Jfl J is empty for J G 2' n l The function 
w defines a convex polytope Q w of dimension < n — 1 as follows: 

Q w := { x E R n : xx + x 2 H hx n = «)([n]) 

and X]i£/ X j — 10 (^) f° r all 7^ J C [n] } 

Thus the polytope corresponding to a subset / is the simplex A/ = convjefc : k E I}. 
Now consider an arbitrary subset = I2, ■ ■ ■ , I r } of 2' m '. It defines the submodular 
function wg = wi 1 +Wj 2 +- ■ -+wi r . The corresponding polytope Q Wg is now the Minkowski 
sum [24J 

A^ = A 7l +A /2 + --- + A /r . (4) 

Proposition 4.2. The projection of mconv (Ag^)) to the V -coordinates ([3]) is Ag. 

Proof. The mconv construction is equivalent to translating each simplex that is the convex 
hull of each set of vectors Ai C A by setting its [/-coordinates ([3]) all to 1, then taking 
the Minkowski sum. □ 

Next is a version of Theorem 17.11 for varieties Z A . Note that \y \ = m when Uj e #7 = 
[m]. Now we have a separate partition function for each conditioned-upon set. 

Theorem 4.3. For A = Ag(s), the map v : Z A -> defined by 

= Y,z77)Y,\ Zi \ I \ a - i \^ 

ie£ iy - ' iel 
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where Zj = \ z i\i\> m &V s %A onto mconv(^l) ; and is a bijection on -2U,>o- 

Proof. The map v is the composition of two maps. The first map, V\ : Z^ — > Y\.i^g ^ii 
is a product of maps yUi corresponding to each submatrix Ai as in the proof of Theorem 
17.11 It ssends a point ((^i, . . . , Zi^\i), I G S) G Z^ to the point p = (pi\i = -z^)\ z i\i\ '■ 
i G / G £ ) in the product of simplices H/gcT ^j, "which can be thought of as possibly 
redundant barycentric coordinates. The second map, z/ 2 , corresponds to the Minkowski 
sum, with z/ 2 : f]j g(f A/ — > mconv(^4) sending p to Ap. Whereas in the simplex case 
(and for a single ^4/) in Theorem 17.1} fii and fi2 are identities, here there is additional 
ambiguity introduced by the Minkowski sum. In particular, let b G A^ (J3J). Then the 
preimage of b in Ylieg is 

P A (6) = {p:^p = 6}nJ]A/, 

and in general consists of a polytope. This is illustrated in Figure [7J where the polytope 
Pa{^) is the set of pairs of points in the first and second simplex that add to b. Analogously 
to the one-factor case (Theorem 17. ip . we will choose among the points of this fiber by 
selecting the maximum entropy point (or the point closest in the KL-divergence sense 
to the point representing a uniform distribution in all simplices). The resulting space of 
solutions (the space of conditional probability distributions) is illustrated in Figure [HJ 
Setting D(p) = D(p\\p uniform ) so 

D (p) = Pi\i l °SPi\i- Pi\i lo &(jj\)> 

the Hessian of D is — on the diagonal and zero elsewhere. Thus it is positive definite 

Pi\i to ^ 

on the interior of J]j e ^A/, anc ^ on P°i n ^ s of the relative interior after restricting to 
nonzero coordinates. Thus D has a unique minimum p* on rij e ^ A/. Were there another 
minimum, the (possibly restricted) Hessian would be positive definite on the open segment 
connecting it with p*. We now argue that p* G Z^. 

First suppose p* G (]l /g(r A/) , so that < Pi\j < 1 in all coordinates, and let 
u G ker A. We must show that p u+ = p u . For small t, p* + tu G Ylie<ff ^ an< ^ 

D(p* +tu) = ^ (PAi + tu Ai) lo g(Pi|/ + tu Ai) - z~2 (P^ 1 + tUi \^ log |J| 

^2 Ui\il0g(p i \ I + tu i \ I )+ ^2^!+ ^ u i|I lo gjT|- 



ieJe<? iei&ff i&i&s 

Since A is <f-multigraded, the last two terms of ^ are zero (i.e. (1, 1, . . . , 1) G 1R"^" is in 
the rowspace of A, and (1, 1, . . . , 1) G M' 7 ' is in the rowspace of each Ai). At t = 0, the 
first order condition implies that 

dD x ^ 
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Grouping the sum by the sign of Ui\i and changing to exponential notation, 

p u+ = p u~ 



(5) 



as desired. 

Now suppose that p* lies on the boundary of Ylieg Ar- If the zeros of p lie outside 
supp(w), the argument made above for p* in the interior holds after extending D with the 
limit plog(p) — > as p — > 0. If there are zeros on both sides of (jSJ), i.e. Pi\i = = pj\j for 
indices i\I G supp(w + ) and j\J G supp(w~), then the relation holds with = 0. 

We may assume pm = for some index i\I G supp(w + ) in considering the two remain- 
ing cases. The first case has pj\j = 1 for some index j\ J G supp(w + ). Because of the 
multigrading of A, which requires for any J G $ and u G ker^4 that 'Yjjej u j\J = 0; ^ 
must be that there exists k\J G supp(w _ ). Then since p G ri/e^^^' we nave Pk\j = 
and the relation ([5]) holds as = 0. 

The second case has < Pj\j < 1 for all j\J G supp(n + ) and < p k \ K < 1 for all 
k\K G supp(u^). Then for small t, p* + tu G Then we have 

= X] + + 5^ U 3\j(Pj\J + tu l\j)- ( 6 ) 

( " {i|I:Pi|i=0} 0V:PiV^°> 

Then the first term on the right hand side of ([6]) approaches negative infinity as t — > 
while the second approaches a constant; this contradicts the optimality of p*, so this case 
cannot arise. □ 




X + X = o + o 

Figure 7: Ambiguity arising from Minkowski sum of simplices: two points appearing in 
the fiber over b in Yli^g Ar- F° r an y point on the dotted line, there is a point in the 
second simplex such that their sum is b. We choose x among these points by maximizing 
entropy in the conditional probability distribution. See Figure M for the space of solutions. 

We now give a couple of examples. 
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Figure 8: The space of conditional probability distributions is the blow-up of P 2 at the 
point p 2 = P3 = of Figure [21 intersected with a triangular prism. In general and in 
higher dimensions, blow-ups are along the conditioned-upon faces. E has homogeneous 
coordinates (p2|23 : P3123) and the triangle has homogeneous coordinates (pi : p 2 : ps)- 

Example 4.4. For the case m = 3 with § = {12, 13, 23, 123}, the matrix A is 





Pi 


P2 


P3 


Pl|12 


P2I12 


Pl|13 


P3|13 


P2|23 


P3|23 
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(I 
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1 
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2 





1 








1 








1 





3 








1 











1 





1 


12 











1 


1 














13 

















1 


1 








23 























1 


1 


123 


V 1 


1 


1 

















/ 



The [/-coordinate rows are labeled 1, 2, 3 and the ^-coordinate rows are labeled 12, 13, 23, 123. 
The polytope mconv(^4) is the permutohedron which is the convex hull of the permuta- 
tions of (3, 1,0), shown in Figure Letting A' be the last six columns of A (restriction 
to {/ C [n] : |/| = 2}), mconv(^4') is the regular permutohedron conv((2, 1, 0), (2,0,1), 
(1,0,2), (1,2,0), (0,2,1), (0,1,2)), lifted with the last four coordinates all 1. This is 
illustrated in Figure [TOl 

The theorem of Matus (Theorem 14.11) works in this way by projecting first from $ = 
{I : |/| > 2} to <§ = {I : |/| = 2} as in Figure [TUl Thus the result may be understood 
as saying that instead of all simplices, we can obtain a regular permutohedron merely as 
the zonotope given by the Minkowski sum of the 1-simplices. 
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301 310 
• • 




Figure 9: Multigraded convex hull of A for n = 3 and $ — {I C [n] : \I\ > 2}. The last 
four coordinates, not shown, are all 1. 

201 # # 210 




012 021 



Figure 10: Multigraded convex hull of A for n = 3 and $ = {I C [n] : |/| = 2}. The last 
four coordinates, not shown, are all 1 

5 Partially observed discrete random variables 

Let Xi, . . . , X n be discrete random variables with Xi taking values xj, . . . , . Then the 
m = nr=i di singleton events in Q are the elements of the Cartesian product of the sets 
of states which each random variable may assume. For a subset of random variables 
. . . , X ik with S := {ii, . . . , ik} Q [n], we write Qs f° r the Cartesian product of the 
states of this subset of the random variables. We also denote by x\ s the restriction of 
some global state x G Q to the states of the random variables in S. Then the set of events 
§ has the form: 

$ = {x' e Q : x'\s = xs for some SC. [n),Xs € &s} (7) 

Let E(xs) denote the event which is the union of all singleton events with random 
variables S in state xs- For example, let n = 3, d{ = 2 with states denoted and 1, and 
S = {1,3}. Then £(x?a^) = {0010,0011,0110,0111}, which corresponds to a 2-face of 
the 4-cube. Now we may write with the more usual notation 

Px A \x B ■= PE(x A )nE(x B )\E(x B ) 

which is convenient for considering, say, the conditional probability of having a disease 
given a positive test result. Besag's relation ([1]) among positive conditional probabilities 
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is written this way: 



P ( x ) _ TT p ( x i\ x i, • • • » x i-u Vi+u ■■■,»») ^ 
p (y) f = i P{yi\xi,...,Xi-i,y i+1 ,...,y n )' 

This is a special case of the relations derived in Theorem I3.2[ as we now explain. 

Denote the event x±, . . . , Xj_i, yj, . . . , y n by j, so the singleton events are (yi, . . . , y n ) = 
1, 2, . . . , n + 1 = (xi, . . . , x n ). The set S consists of the event {1, . . . , n + 1} together with 
the events {j,j + 1} for j = 1, . . . , n. Then the cleared-denominator version of ([T]) is the 
outer cycle [n + 1] — > 1 12 — ► 2 «— • • ■ ^n,n + l— <— [n + 1] in the graph 
G^. For example, with three variables we have events 1 = (1/1,2/2,2/3), 2 = (£1,2/2,2/3), 
3 = (xi, X2, 2/3), and 4 = (sci, £2,^3)- The relation (JTJ is 

_ £>2|12P3|23P4|34 
Pi Pl|12P2|23P3|34 ' 

corresponding to the cycle binomial 

Pl£>2|12P3|23£>4|34 ~ £>4Pl|12P2|23P3|34, 

which is fc for the outer cycle C of the graph in Figure [TT1 

1 , Pl ' 12 12 P2 ' 12 , 2 




P4|34 P3|34 



Figure 11: Bipartite graph for <f = {{1, 2}, {2, 3}, {3, 4}, {1, 2, 3, 4}}. 



6 Bayes' rule 

Because of the Bayes binomials, on points which are projective conditional probability 
distributions, we have, with i, j C J C K C [m], 

Pi\KPj\J = Pj\KPi\J- 

This implies, by summing over j e J, that 

Pi\KPj\J = Pj\KPi\J- (9) 

Using two copies of (Q with different intermediate sets J\ and J2, we have 

(Pi\JiPJi\k)PJ2\J2 =Pi\KP.h\jJ>H.h = (Pi\JaPJ a \K)PJj.\Ji 
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which gives a multihomogeneous version of Bayes' rule. Because we consider the point 
representing a projective conditional probability distribution as an element of {{pi^i ' 
■■■ ■ Pi\n\l),I £ <£), we ma Y se ^ PJi\Ji an d Pj 2 \J 2 to 1 on an open set containing all 
probabilistically relevant points, and summing over % £ /, this becomes 

Pl\.hP.h\K = Pi\j 2 Pj 2 \k- 

Or when 7^ 0, 

P/| j 2 Pj 2 \k 
PJi\k 

so that in particular, with A, B C [m], and setting / = A D -B, J\ = B, J 2 = A, and 
= [m] we have the familiar expression for Bayes' rule 

Pahb\aPa 

Padb\b = • 

Pb 

7 Appendix: Toric ideals and toric varieties 

Here we collect some needed facts about toric ideals and toric varieties based primarily 
on Sturmfels' book [23] , also referring to [6j [H [TOj ISl [19] • 



7.1 Affine toric varieties 

Let A be a d x m integer matrix, with columns a. t \, . . . ,a. m . Let C[xi, . . . ,x m ] be a 
polynomial ring in m variables, and for u £ Z m let x u = njli 3 -/- The matrix .4 defines 
a tone zrfea/ 

/a = {x u+ — x u ~ '. u £ ker A PI Z m ), 

where w + is the positive part of w and w~ the negative. The toric ideal I a is a prime 
ideal. A minimal set of binomials which generates lj± is said to be a Markov basis for 
the matrix A. A term order is a total order on the monomials of a polynomial ring 
such that 1 is the unique minimal element and mi >- irt2 implies mgmi y m^m2 for any 
monomials 1711,7712,1713. This order defines the initial monomial of any polynomial, and 
the initial ideal of an ideal I is generated by the initial monomials in v / for all / £ /. A 
Grobner basis {fi, . . . , /&} for an ideal / with respect to a monomial term order y has 
in^(J) = (in^(/x), • • -^y(fk))- A Grobner basis is universal if it is a Grobner basis for 
all term orders >-. For polynomials / and g and term order y, let m(f,g) be the least 
common multiple of their leading monomials, and let /o, go be their leading terms. Then 
their S'-polynomial is m (j^ f — m ^ ,3 - > g and is used in Buchberger's algorithm. 

In the affine space C m with coordinates X\, . . . ,x m , the ideal 1^ cuts out the affine 
toric variety Xj±. The IR> -span of the columns of A define a cone pos(^4), and the N-span 
defines a semigroup NA. The corresponding semigroup ring C[N.A] is isomorphic to the 
affine coordinate ring C[xi, . . . ,x m ]/I^, i.e.X^ = Spec(C[xi, . . .x m ]/I^) = SpecC[KL4]. 
Such varieties are not always normal. The matrix A defines a map : 6 1— > 9 from the 
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(i-dimensional torus to the toric variety X4. This gives an explicit torus action and 
torus embedding. The closure of the image of f is X4. This is also the parameterization 
map of an exponential family. 

7.2 Polytopes and projective toric varieties 

Let conv(^4) be the convex hull of the columns of A. This is a polytope. Let Y4 be 
the projective toric variety defined by taking the closure of the image of f_4, and viewing 
xx, . . . ,x m as homogeneous coordinates. The corresponding homogeneous toric ideal is 
the ideal 

J A = (x u+ — x u ~ : u G ker A Pi Z m , H^+H! = (10) 

The affine cone over F4 is the toric variety Xjj , where A' is A with a row of ones added 
at the bottom unless the vector of all ones already lies in rowspan(^4). This induces 
homogeneity with respect to the Z-grading. When A has (1,1,..., 1) in its row span (e.g. 
by having equal column sums or (1, 1, . . . , 1) as a row), we say it is Z-graded and the norm 
restriction in (fTO]) is not required. Instead of (1, 1, . . . , 1), we can use another grading of 
the columns of A to obtain multihomogeneous ideals. 

7.3 The moment map 

The moment map sends a projective toric variety Y4 onto its polytope conv(^4.), bijectively 
on the nonnegative part of the variety. Theorem 14.31 is a version of this result for toric 
varieties in a product of projective spaces. 

Theorem 7.1. Let A be a d x m, Z-graded matrix, and Y4 the corresponding projective 
toric variety. Then the map 

fi : Y4 — > conv(^4), given by 

where Z(y) = \yj\, is a bisection from Ya,>o on t° conv(^4.). If further rank(^4.) = d, 
with f.4 the torus embedding, then jj, o is homeomorphism R> — > conv(^4)°. 

The result is standard and a proof can be found in [23j [lOl [8] and goes by the name 
Birch's theorem in statistics. 
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